|
Lesson Three: What's next? (Search Engines and Web Indexes) In this lesson we will discuss how search engines work in general terms, not all possible scenarios (or search algorithms!). What is a search engine really and how does it work?What we think of as a search engine is really a team effort. There are 3 "members" of the team -- a mechanism that identifies web pages to be included in the database, a mechanism that indexes the sites and a searching mechanism with an interface, which scans, for keywords within the index. Users search the index (and hence, the database or web documents) through a query box or a template. Documents in which the search terms occur are presented as "hits."Although some facilities are beginning to incorporate "natural language"
searching (searching by asking a question "Where are the doughnuts?"),
most search tools retrieve "hits" or "matches" by seeking occurrences of
your search terms within its database and by attempting to match the terms
(converted to a "string" of data bits) against its index. Because the terms
are converted to a digital string, the search engine must somehow be instructed
to include plurals and alternate forms of a term (note: although some
search tools automatically include plurals, many do not. If you are interested
in "dogs," search for "dog or dogs.")
What's a 'bot?A 'bot, otherwise known as an intelligent agent, spider, crawler, robot, or worm, is an automated device (software) which may be programmed to search for terms ("strings") matching certain criteria. In terms of web search engines, a 'bot identifies and notes the url's of web pages to be included in the database. Later, another 'bot comes along and works on the interiors of the web documents, recording occurrences of words and their position within the text. This information is used to create a huge index. 'Bots travel along the links of a web site, that is, they crawl or traverse from one hypertext link to another.What's the index for?The index is how the search engine locates the url's which match your request. The web documents, containing the query keywords are presented as a listing, which may include a brief summary of the site. A simple way to understand the index is to think of it as a computerized book index. To discover where a topic occurs in a book, we would look up the word in the index which would indicate the page number(s) where the term occurs. Now imagine that every single word is included in the book index. A computerized version might be represented like this:
How does a search engine decide how to list web sites matching my search terms?Each search engine uses a different algorithm or method to calculate something called a "relevance" which it "ranks." Have you ever noticed the numbers which sometimes appear next to the url's in a listing of search results? This is the "relevance ranking." Relevance means the probability that the "hit" or "match" is on-target with your query. The creators of search engines change the way they calculate relevance and do not tell us mere users their methodology; being high in the major search engine's rankings on a topic means big business. Unscrupulous folks "spam" the search engine to try to improve their rankings (and hence, their web-based business). So exactly how a search engine calculates relevance is protected, proprietary information.Note: because each search engine assigns relevancy rankings differently, if you execute exactly the same search in several search engines, you will have different results in terms of how and where the url's are listed (even if their database contents were identical). In general, however, relevance is calculated by noting where the term occurs within the text and assigning this position a "weight" or level of importance. Terms occurring in the title, summary, in key positions within a paragraph or appearing several times within a paragraph usually carry more "weight" because there is a higher probability that terms in these positions indicate significant material on the topic. This is very similar to our book index example above; because the term apple occurs many times and in key positions (title, table of contents, beginning of paragraphs), there is a high probability that the document contains significant information about apple. Note that orange also occurs in the table of contents, an indication of the term's relative importance (it is a significant topic, but not as important as apple). The algorithm of the search engine and the methodology it uses to calculate relevance, emulate the observations and judgments we make based on our experience. A search engine will return our book index as a hit when the search terms apple and grape are requested whereas a human might judge that although the two terms occur within the document, there is no significant relationship between them and is hence irrelevant. Some search engines look only in certain fields to index documents, such as the title field, first paragraph, and in something called "meta-tags." Meta-tags allow the creator of a web site to add descriptive keywords which are not displayed in the actual web documents; they are specifically to enhance retrieval of the document. As people "spam" the search engine (for example, by repeating terms over and over again), meta-tags are decreasing in importance because the folks that program the 'bots train them to overlook repetitions and other clues to "spamming." What's the best search engine?I'm sure I'm going to disappoint a lot of folks by giving the answer "the best search engine is the one that fits the task." Until you have some experience with knowledge seeking tools, and importantly, with identifying your real information need (for example, a query on "Leonardo di Vinci's Mona Lisa" is likely to be more successful than "that lady with the smile by a Renaissance artist" or "dosage and usage guidelines for St. John's Wort" as opposed to "St. John's Wort") it may be difficult to ascertain which tool is best for your purpose. But the good news is, you will make better choices with experience.What do I use? well, that depends....Remember I am a librarian in an academic (college) library, so I never know what the next information request will be (that's the fun part!). But this means in practical terms that I am looking for information in a variety of places, which precludes having a standard game plan..... here's a few of my search tactics/favorite tools:
What are simple ways to make my search more effective?A very effective way to increase the relevance or precision of "hits" is to search as a phrase. In most cases simply means putting quotation marks around the search terms. "Red socks" is a different search than red socks in most search engines. What you are actually doing by searching as a phrase is using the concept of proximity which concerns the terms' physical closeness to one another (their proximity). A document with red socks occurring close or next to each other are more likely to be on target than a document with red in the title and socks buried in the text.Another way to increase your search effectiveness is to be as specific
as possible; that is including as many terms and synonyms as you can think
of to fully describe your topic. Instead of
try Note: search utilities may not support the use of parentheses
or nesting in basic searches, although many support them in their
"advanced" searches.
What are the most popular and useful search utilities? (the "major" search engines)Ok folks. We are looking at a sampling of search engines and describing generalities; we are not attempting to create a definitive listing. For example, we'll be discussing meta search engines in Lesson 6, so you won't find them listed here.
Developed by Digital Equipment Corporation, Alta Vista searches the Web and Usenet. In its very large database, both simple and advanced searching are supported with the ability to limit searches to select portions of web documents. For example, it is possible to limit searches to title, domains, images and links within Web documents and by particular newsgroups or subjects in Usenet. Also, ability to browse by subject (although this is rather slow). Search site featuring a very large database and a lot of "extras" such as: Excite Channels (guide to sites by subject), stock quotes, news, tv and searching of Newsgroups. Offers concept searching. Voted no. 1 among search engines by PC Magazine, Hot Bot offers a sophisticated interface with a vast array of options such as: searching by dates, by certain domains in the U.S. (e.g. .com, .org, .edu, .gov), by media type (e.g. image, audio, video). Also, a huge database, powerful advanced searching options, access to other search tools by type and a subject guide.
Specialized Search Engines and Collections:Specialized search engines are most often programmed to "collect" web documents along a topical theme. For example, in the Arts, Science, Health-related topics or even more specialized subjects such as Ancient History of the Mediterranean.Also fitting in this category are "search tools" that really calculate rather than retrieve information (such as those fitting in the "distance between two points" or "salary differential" categories). Since it is impossible to list specific tools here, the following are sites which group or list subject specific search engines or tools:
(http://www.allonesearch.com/) Links to specialized search engines such as Career Mosaic, CIA World Factbook, How far is it? (distances between cities, with maps and driving directions), and Movies. (http://www.beaucoup.com) Beaucoup is a collection of approximately 1000 search engines, directories and indices from all over the world, organized into categories such as: General Searchers, Reviewed Sites/What's New, Software, Reference, Education, Art/Graphics, Social/Environmental/Political Concerns, and Consumer Medicine. Good starting point for popular subjects. (http://www.hamrad.com/search.html) Search for a search engine by subject, keyword or country. (http://www.isleuth.com/) A good starting point to find a research oriented search facility. Over 3000+ databases included. For more information:
(http://www.monash.com/spidap.html) (http://www.searchenginewatch.com) Teaching Library Internet Workshops, Teaching Libraries of University of California, Berkeley (http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Strategies.html) AssignmentsThis week we are going on an Infoquest! Please find answers to the following questions using either a subject directory that we discussed in Lesson 2, or a search engine.Remember -- there are many routes to the same information....
Evaluate the following "major" search engines:
Find a search facility that will help you find the following types of information. Please include a sample question/reason for inquiry.
Last updated: February 24, 1999, Links checked: February 24, 1999 |